Goto

Collaborating Authors

 Kelowna



MANTRA: a Framework for Multi-stage Adaptive Noise TReAtment During Training

Zhao, Zixiao, Fard, Fatemeh H., Wu, Jie JW

arXiv.org Artificial Intelligence

The reliable application of deep learning models to software engineering tasks hinges on high-quality training data. Yet, large-scale repositories inevitably introduce noisy or mislabeled examples that degrade both accuracy and robustness. While Noise Label Learning (NLL) has been extensively studied in other fields, there are a few works that investigate NLL in Software Engineering (SE) and Large Language Models (LLMs) for SE tasks. In this work, we propose MANTRA, a Multi-stage Adaptive Noise TReAtment framework that embeds noise diagnosis and mitigation directly into the fine-tuning process of code-Pretrained Language Models (PTM) and code-LLMs. We first investigate the effect of noise at varying levels on convergence and loss trajectories of the models. Then we apply an adaptive dropout strategy guided by per-sample loss dynamics and Gaussian Mixture Model clustering to exclude persistently noisy points while preserving clean data. Applying to code summarization and commit intent classification, our experiments reveal that some LLMs are more sensitive to noise than others. However, with MANTRA, the performance of all models in both tasks is improved. MANTRA enables researchers and practitioners to reduce the impact of errors introduced by the dataset in training, saves time in data cleaning and processing, while maximizing the effect of fine-tuning.


Fourier-Invertible Neural Encoder (FINE) for Homogeneous Flows

Ouyang, Anqiao, Ke, Hongyi, Wang, Qi

arXiv.org Artificial Intelligence

We present the Fourier-Invertible Neural Encoder (FINE), a compact and interpretable architecture for dimension reduction in translation-equivariant datasets. FINE integrates reversible filters and monotonic activation functions with a Fourier truncation bottleneck, achieving information-preserving compression that respects translational symmetry. This design offers a new perspective on symmetry-aware learning, linking spectral truncation to group-equivariant representations. The proposed FINE architecture is tested on one-dimensional nonlinear wave interaction, one-dimensional Kuramoto-Sivashinsky turbulence dataset, and a two-dimensional turbulence dataset. FINE achieves an overall 4.9-9.1 times lower reconstruction error than convolutional autoencoders while using only 13-21% of their parameters. The results highlight FINE's effectiveness in representing complex physical systems with minimal dimension in the latent space. The proposed framework provides a principled framework for interpretable, low-parameter, and symmetry-preserving dimensional reduction, bridging the gap between Fourier representations and modern neural architectures for scientific and physics-informed learning.


FISCAL: Financial Synthetic Claim-document Augmented Learning for Efficient Fact-Checking

Sharma, Rishab, Saberi, Iman, Alipour, Elham, Wu, Jie JW, Fard, Fatemeh

arXiv.org Artificial Intelligence

Financial applications of large language models (LLMs) require factual reliability and computational efficiency, yet current systems often hallucinate details and depend on prohibitively large models. We propose FISCAL (Financial Synthetic Claim-Document Augmented Learning), a modular framework for generating synthetic data tailored to financial fact-checking. Using FISCAL, we generate a dataset called FISCAL-data and use it to train MiniCheck-FISCAL, a lightweight verifier for numerical financial claims. MiniCheck-FISCAL outperforms its baseline, surpasses GPT-3.5 Turbo and other open-source peers of similar size, and approaches the accuracy of much larger systems (20x), such as Mixtral-8x22B and Command R+. On external datasets FinDVer and Fin-Fact, it rivals GPT-4o and Claude-3.5 while outperforming Gemini-1.5 Flash. These results show that domain-specific synthetic data, combined with efficient fine-tuning, enables compact models to achieve state-of-the-art accuracy, robustness, and scalability for practical financial AI. The dataset and scripts are available in the project repository (link provided in the paper).



WildFireCan-MMD: A Multimodal Dataset for Classification of User-Generated Content During Wildfires in Canada

Sherritt, Braeden, Nejadgholi, Isar, Aivaliotis, Efstratios, Mslmani, Khaled, Amini, Marzieh

arXiv.org Artificial Intelligence

Rapid information access is vital during wildfires, yet traditional data sources are slow and costly. Social media offers real-time updates, but extracting relevant insights remains a challenge. In this work, we focus on multimodal wildfire social media data, which, although existing in current datasets, is currently underrepresented in Canadian contexts. We present WildFireCan-MMD, a new multimodal dataset of X posts from recent Canadian wildfires, annotated across twelve key themes. We evaluate zero-shot vision-language models on this dataset and compare their results with those of custom-trained and baseline classifiers. We show that while baseline methods and zero-shot prompting offer quick deployment, custom-trained models outperform them when labelled data is available. Our best-performing custom model reaches 84.48% f-score, outperforming VLMs and baseline classifiers. We also demonstrate how this model can be used to uncover trends during wildfires, through the collection and analysis of a large unlabeled dataset. Our dataset facilitates future research in wildfire response, and our findings highlight the importance of tailored datasets and task-specific training. Importantly, such datasets should be localized, as disaster response requirements vary across regions and contexts.


Analysis of AdvFusion: Adapter-based Multilingual Learning for Code Large Language Models

Esmaeili, Amirreza, Seddik, Fahd, Ji, Yongyi, Fard, Fatemeh, Chen, Fuxiang

arXiv.org Artificial Intelligence

Programming languages can benefit from one another by utilizing a language model for software engineering tasks. Full fine-tuning and Parameter Efficient Fine-Tuning (PEFT) of Code Language Models (Code-LMs) has been explored for multilingual knowledge transfer. AdapterFusion is a PEFT architecture that aims to enhance task performance by leveraging information from multiple programming languages, but primarily focuses on the target programming language. In our previous work, we proposed AdvFusion, a novel PEFT-based approach that effectively learns from other programming languages before adapting to the target task. Though previous experiments showed that AdvFusion outperformed AdapterFusion and LoRA, it was applied on pre-trained Code-LMs and was limited to only two tasks, code summarization and method name prediction. In this study, we expanded our work and investigated AdvFusion on Code Large Language Models (Code-LLMs), considering three new tasks: code generation, code translation, and commit message generation. We observed that different Code-LLMs/tasks exhibit different characteristics. In code generation, AdvFusion outperformed AdapterFusion but not other PEFT methods (LoRA, Compacter, and TaskAdapter). In commit message generation, AdapterFusion performed better than AdvFusion, and contrary to code generation, we found that the other PEFT methods do not have better performance. In code translation, AdvFusion performed worse than AdapterFusion overall, with the performance gap marginally widening as the model size increases. However, consistent with code generation, other PEFT methods showed better performance.


CodeWatcher: IDE Telemetry Data Extraction Tool for Understanding Coding Interactions with LLMs

Basha, Manaal, Ribeiro, Aimeê M., Javahar, Jeena, de Souza, Cleidson R. B., Rodríguez-Pérez, Gema

arXiv.org Artificial Intelligence

Understanding how developers interact with code generation tools (CGTs) requires detailed, real-time data on programming behavior which is often difficult to collect without disrupting workflow. We present \textit{CodeWatcher}, a lightweight, unobtrusive client-server system designed to capture fine-grained interaction events from within the Visual Studio Code (VS Code) editor. \textit{CodeWatcher} logs semantically meaningful events such as insertions made by CGTs, deletions, copy-paste actions, and focus shifts, enabling continuous monitoring of developer activity without modifying user workflows. The system comprises a VS Code plugin, a Python-based RESTful API, and a MongoDB backend, all containerized for scalability and ease of deployment. By structuring and timestamping each event, \textit{CodeWatcher} enables post-hoc reconstruction of coding sessions and facilitates rich behavioral analyses, including how and when CGTs are used during development. This infrastructure is crucial for supporting research on responsible AI, developer productivity, and the human-centered evaluation of CGTs. Please find the demo, diagrams, and tool here: https://osf.io/j2kru/overview.